Introduction

Whats Covered
Additional Resources
Libraries and Data

Unsupervised Learning in R

Hierarchical clustering

Dimensionality reduction with PCA

Putting it all together with a case sudy

Introduction

Course notes from the Unsupervised Learning in R course on DataCamp
- Taught by Hank Roark, senior data scientist at Boeing

Whats Covered

Unsupervised learning in R
Hierarchical clustering
Dimensionality reduction with PCA
Putting it all together with a case study

Additional Resources

Making nicer biplots with ggplot2
Top 10 data mining algorithms in plain english
- nice article that talks about k-means and other algorithms in plain english

Libraries and Data

source('create_datasets.R')

library(readr)
library(dplyr)
library(ggplot2)
library(stringr)

Unsupervised Learning in R

Welcome to the course!

3 main types of machine learning:

unsupervised learning
- goal is to find structure in unlabeld data
- unlabeled data is data with no targets
supervised learning
- regression or classification
- goal is to predict the amount or the label
reinforcement learning
- a computer learns by feedback from operating in a real or synthetic environment

Two major goals:

find homogenous subgroups within a population
- this is called clusters
- example: segmenting a market of consumers based on demographic features and purchasing history
- example: find similar movies based on features of each movie and reviews of the movies
find patterns in the features of the data
- dimensionality reduction - a method to decrease the number of features to describe an onservation while maintining the maximum information content under the constraints of lower dimensionality

dimensionality reduction

find patterns in the fetures of the data
visualization of high dimensional data
- its hard to produce good visualizations past 3 or 4 dimensions and also to consume
pre-processing step for supervised learning

Introduction to k-means clustering

an algorithm used to find homogeneous subgroups in a population
kmeans comes in base R
- need the data
- number of centers or groups
- number of runs. its start by randomly assigning points to groups and you can find local minimums so running it multiple times helps you find the global min.
you can run kmeans many times to estimate the number od subgroups when it is not known a priori

– k-means clustering

str(x)

##  num [1:300, 1:2] 3.37 1.44 2.36 2.63 2.4 ...

head(x)

##          [,1]     [,2]
## [1,] 3.370958 1.995379
## [2,] 1.435302 2.760242
## [3,] 2.363128 2.038991
## [4,] 2.632863 2.735072
## [5,] 2.404268 1.853527
## [6,] 1.893875 1.942113

# Create the k-means model: km.out
km.out <- kmeans(x, centers = 3, nstart = 20)

# Inspect the result
summary(km.out)

##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

– Results of kmeans()

# Print the cluster membership component of the model
km.out$cluster

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 3 3 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2
## [281] 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 2 3 3

# Print the km.out object
km.out

## K-means clustering with 3 clusters of sizes 150, 98, 52
## 
## Cluster means:
##         [,1]        [,2]
## 1 -5.0556758  1.96991743
## 2  2.2171113  2.05110690
## 3  0.6642455 -0.09132968
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 3 3 3 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 2
## [281] 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 295.16925 148.64781  95.50625
##  (between_SS / total_SS =  87.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

– Visualizing and interpreting results of kmeans()

# Scatter plot of x
plot(x, 
  col = km.out$cluster,
  main = "k-means with 3 clusters",
  xlab = "",
  ylab = "")

How kmeans() works and practical matters

Process of k-means:

randomly assign all points to a cluster
calculate center of each cluster
convert points to cluster of nearest center
if no points changed, done, otherwise repeat
calculate new center based new points
convert points to cluster of nearest center
and so on

model selection:

best outcome is based on total within cluster sum of squares
run many times to get global optimum
R will automaitcally take the run with the lowest total withinss

determining number of clusters

scree plot
look for the elbow
find where addition on new cluster does not change best withinss much
there ususally is no clear elbow in real world data

– Handling random algorithms

# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))

# Set seed
set.seed(1)

for(i in 1:6) {
  # Run kmeans() on x with three clusters and one start
  km.out <- kmeans(x, centers = 3, nstart = 1)
  
  # Plot clusters
  plot(x, col = km.out$cluster, 
       main = km.out$tot.withinss, 
       xlab = "", ylab = "")
}

– Selecting number of clusters

# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(x, centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Set k equal to the number of clusters corresponding to the elbow location
k <- 2

Introduction to the Pokemon data

Data hosted on kaggle here
More pokemon info here

Data challenges:

selecting the variable to cluster upon
scaling the data (we will cover this)
determining the true number of cluster
- in real world data a nice clean elbow in the scree plot does not usually exist
visualizing the results for interpretation

– Practical matters: working with real data

pokemon_raw <- read_csv('data/Pokemon.csv')
head(pokemon_raw)

## # A tibble: 6 x 13
##     `#`                  Name `Type 1` `Type 2` Total    HP Attack Defense
##   <int>                 <chr>    <chr>    <chr> <int> <int>  <int>   <int>
## 1     1             Bulbasaur    Grass   Poison   318    45     49      49
## 2     2               Ivysaur    Grass   Poison   405    60     62      63
## 3     3              Venusaur    Grass   Poison   525    80     82      83
## 4     3 VenusaurMega Venusaur    Grass   Poison   625    80    100     123
## 5     4            Charmander     Fire     <NA>   309    39     52      43
## 6     5            Charmeleon     Fire     <NA>   405    58     64      58
## # ... with 5 more variables: `Sp. Atk` <int>, `Sp. Def` <int>,
## #   Speed <int>, Generation <int>, Legendary <chr>

pokemon <- pokemon_raw %>% select(6:11)
head(pokemon)

## # A tibble: 6 x 6
##      HP Attack Defense `Sp. Atk` `Sp. Def` Speed
##   <int>  <int>   <int>     <int>     <int> <int>
## 1    45     49      49        65        65    45
## 2    60     62      63        80        80    60
## 3    80     82      83       100       100    80
## 4    80    100     123       122       120    80
## 5    39     52      43        60        50    65
## 6    58     64      58        80        65    80

# Initialize total within sum of squares error: wss
wss <- 0

# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.out
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.out$tot.withinss
}

# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Select number of clusters
k <- 4

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# View the resulting model
km.pokemon

## K-means clustering with 4 clusters of sizes 114, 115, 288, 283
## 
## Cluster means:
##         HP    Attack   Defense   Sp. Atk  Sp. Def     Speed
## 1 89.20175 121.09649  92.73684 120.45614 97.67544 100.44737
## 2 71.30435  92.91304 121.42609  63.89565 88.23478  52.36522
## 3 79.18056  81.31944  69.19097  82.01042 77.53125  80.10417
## 4 50.29682  54.03180  51.62898  47.90459 49.15548  49.74912
## 
## Clustering vector:
##   [1] 4 3 3 1 4 3 3 1 1 4 3 3 1 4 4 3 4 4 3 3 4 4 3 1 4 3 4 3 4 3 4 3 4 2 4
##  [36] 4 3 4 4 3 4 3 4 3 4 3 4 3 4 3 3 4 2 4 3 4 3 4 3 4 3 4 3 4 1 4 4 3 4 3
##  [71] 3 1 4 3 2 4 3 3 4 3 4 2 2 3 3 4 2 2 4 3 4 4 3 4 3 4 3 4 2 4 3 3 1 2 4
## [106] 3 4 2 4 3 4 3 4 2 3 2 4 4 2 4 2 3 3 3 1 4 3 4 3 4 3 3 3 3 3 3 2 1 3 4
## [141] 3 1 3 4 4 3 3 1 3 4 2 4 2 3 1 3 1 1 1 4 3 1 1 1 1 1 4 3 3 4 3 3 4 3 3
## [176] 4 3 4 3 4 3 4 4 3 4 3 4 4 4 4 3 4 3 4 4 3 1 2 4 3 2 3 4 4 3 4 4 3 3 4
## [211] 2 3 2 3 3 3 4 3 3 4 2 3 2 2 2 4 3 3 2 2 2 3 2 3 4 3 4 2 4 3 4 4 3 4 3
## [246] 2 4 3 1 3 4 2 3 3 4 4 2 4 4 4 3 3 1 1 2 4 3 1 1 1 1 1 4 3 3 1 4 3 1 1
## [281] 4 3 3 1 4 3 4 3 4 4 3 4 4 4 4 3 4 4 3 4 3 4 3 4 4 3 1 4 3 4 3 4 3 1 4
## [316] 3 4 4 4 3 4 3 4 2 4 4 4 2 4 2 4 2 2 2 4 3 3 4 3 1 3 3 3 3 3 4 3 4 3 1
## [351] 3 3 4 3 1 2 4 3 4 4 4 3 4 3 4 3 1 3 3 3 3 4 3 4 3 4 2 4 2 4 2 4 3 3 2
## [386] 4 3 1 4 2 3 3 3 1 4 4 3 1 4 3 3 4 2 2 2 4 4 2 1 1 4 2 2 1 2 2 2 1 1 1
## [421] 1 1 1 1 1 1 1 1 1 1 2 1 4 2 2 4 3 3 4 3 3 4 4 3 4 3 4 4 4 4 3 4 3 4 3
## [456] 2 2 4 2 2 2 3 4 2 3 4 3 4 3 4 3 3 4 3 4 3 1 3 3 4 3 4 4 3 4 2 4 4 4 3
## [491] 2 4 3 1 1 3 4 1 1 4 2 4 2 4 3 3 4 3 4 4 3 1 3 2 2 2 2 1 1 3 3 2 3 2 3
## [526] 3 3 1 2 2 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 4 3 3 4 3 3
## [561] 4 3 3 4 3 4 4 2 4 3 4 3 4 3 4 3 4 3 4 4 3 4 3 4 2 2 4 3 4 3 3 2 4 2 2
## [596] 4 3 3 2 3 4 2 3 4 4 3 4 3 4 3 3 4 4 3 4 3 3 3 4 2 4 2 3 4 2 2 2 3 1 4
## [631] 3 4 3 4 3 4 3 3 4 4 3 4 3 4 3 3 4 3 3 4 2 4 3 4 3 3 4 3 4 2 4 2 2 4 3
## [666] 3 4 3 4 4 3 4 3 1 4 3 3 4 3 3 4 3 2 4 2 4 2 2 4 3 4 2 3 2 4 3 1 4 3 1
## [701] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 2 2 4 3 3 4 3 3 4 3 4 4 3 4 4 3
## [736] 4 3 4 4 3 4 3 4 3 3 4 3 3 4 2 1 2 4 3 4 3 4 3 4 2 4 2 4 3 4 3 4 2 4 3
## [771] 3 3 3 2 4 3 1 3 4 3 4 4 4 4 2 2 2 2 4 2 4 3 1 1 1 2 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 408271.9 455965.7 871531.3 513476.6
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
     col = km.pokemon$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")

I don’t know of a good way to plot these clusters in 6 dimensional space.
Here there is a lot of overlap, but if we could see a third dimension the clusters may look a little more distinct.
It would be good for me to learn how to viusalize cluster analysis in a more complelling way.

Hierarchical clustering

Introduction to hierarchical clustering

Typically used when the number of clusters in not known a head of time
Two approaches. bottum up and top down. we will focus on bottum up
process
- assign each point to its own cluster
- joing the two closesest custers/points into a new cluster
- keep going until there is one cluster
- they way you calculate the distance between clusters is a paramater and will be covered later
we have to first calculate the euclidean distance between all points (makes a big matriz) using the dist() function
- this is passed into the hclust() function

– Hierarchical clustering with results

head(x)

##          [,1]     [,2]
## [1,] 3.370958 1.995379
## [2,] 1.435302 2.760242
## [3,] 2.363128 2.038991
## [4,] 2.632863 2.735072
## [5,] 2.404268 1.853527
## [6,] 1.893875 1.942113

# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x))

# Inspect the result
summary(hclust.out)

##             Length Class  Mode     
## merge       598    -none- numeric  
## height      299    -none- numeric  
## order       300    -none- numeric  
## labels        0    -none- NULL     
## method        1    -none- character
## call          2    -none- call     
## dist.method   1    -none- character

Selecting number of clusters

you can build a dendrogram of the distances between points
then you either pick the number of clusters or the height (aka distance) that you want to split the cluster.
- think of this as drawing a horizontal line across the dendrogram
the cutree() function in R lets you split the hierarchical clusters into set clusters by number or by distance (height)

– Cutting the tree

plot(hclust.out)
abline(h = 7, col = "red")

# Cut by height
cutree(hclust.out, h = 7)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1
##  [36] 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1
##  [71] 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [176] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1
## [281] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2

# Cut by number of clusters
cutree(hclust.out, k = 3)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1
##  [36] 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1
##  [71] 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [176] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [211] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1
## [281] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2

Clustering linkage and practical matters

4 methods to measure distance between clusters
- complete: pairwise similarty between all observations in cluster 1 and 2, uses largest of similarities
- single: same as above but uses the smallest of similarities
- average: same as above but uses average of similarities
- centroid: finds centroid of cluster 1 and 2, uses similarity between tow centroids
rule of thumb
- complete and average produce more balanced treess and are more commonly used
- single fuses observations in one at a time and produces more unblanced trees
- centroid can create inversion where clusters are put below single values. its not used often
practical matters
- data needs to be scaled so that features have the same mean and standard deviation
- normalized features have a mean of zero and a sd of one

– Linkage methods

# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")

# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")

# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")

# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")

# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")

– Practical matters: scaling

# View column means
colMeans(pokemon)

##       HP   Attack  Defense  Sp. Atk  Sp. Def    Speed 
## 69.25875 79.00125 73.84250 72.82000 71.90250 68.27750

# View column standard deviations
apply(pokemon, 2, sd)

##       HP   Attack  Defense  Sp. Atk  Sp. Def    Speed 
## 25.53467 32.45737 31.18350 32.72229 27.82892 29.06047

# Scale the data
pokemon.scaled <- scale(pokemon)

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")

– Comparing kmeans() and hclust()

# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)

# Compare methods
table(km.pokemon$cluster, cut.pokemon)

##    cut.pokemon
##       1   2   3
##   1 114   0   0
##   2 114   0   1
##   3 277  11   0
##   4 283   0   0

the scaled heirarchical cluster pretty much puts all observations in cluster 1
- not sure how I feel about that
the instructor says we can’t really say one of these is better.
this was a confusing example thought. I would naturally expect more spread out clusters as with the k-means clusters

Dimensionality reduction with PCA

Introduction to PCA

Two main goals of dimensionality reduciton
- find structure in features
- aid in visualization
PCA has 3 goals
- find a linear combintion of variables to create principle components
- maintain as much variance in the data as possible
- principal components are uncorrelated (i.e. orthogonoal to each other)
intuition
- with an x y correlation scatter plot, the best 1 dimension to explain the variance in the data is the linear regression line
- this is the first prinipal component
- then the distance of the points from the line is the component score (I don’t really understand this part, but I get how the line is simple way to explain the two dimensional data and explains most of the variation in the data.)

Example: Principle components with iris dataset

center and scale - for each point, subtract the mean and divide by the sd
the summary of the model shows you the proportion of variance explained by each principal component
I think the rotation is the distance of the point from each principal component or something like that.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

pr.iris <- prcomp(x = iris[-5], scale = F, center = T)
summary(pr.iris)

## Importance of components%s:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

pr.iris

## Standard deviations (1, .., p=4):
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
## 
## Rotation (n x k) = (4 x 4):
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

– PCA using prcomp()

we are just using 4 variables from the pokemon dataset here

pokemon_pr <- pokemon %>% select(HP, Attack, Defense, Speed)
glimpse(pokemon_pr)

## Observations: 800
## Variables: 4
## $ HP      <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44, 59, 79, 79, 45...
## $ Attack  <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, 48, 63, 83, 103...
## $ Defense <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 65, 80, 100, 120...
## $ Speed   <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, 43, 58, 78, 78,...

summary(pokemon_pr)

##        HP             Attack       Defense           Speed       
##  Min.   :  1.00   Min.   :  5   Min.   :  5.00   Min.   :  5.00  
##  1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00   1st Qu.: 45.00  
##  Median : 65.00   Median : 75   Median : 70.00   Median : 65.00  
##  Mean   : 69.26   Mean   : 79   Mean   : 73.84   Mean   : 68.28  
##  3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00   3rd Qu.: 90.00  
##  Max.   :255.00   Max.   :190   Max.   :230.00   Max.   :180.00

pr.pokemon <- prcomp(x = pokemon_pr, scale = T, center = T)
summary(pr.pokemon)

## Importance of components%s:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3721 0.9933 0.8526 0.6354
## Proportion of Variance 0.4707 0.2467 0.1817 0.1009
## Cumulative Proportion  0.4707 0.7173 0.8991 1.0000

pr.pokemon

## Standard deviations (1, .., p=4):
## [1] 1.3721424 0.9932783 0.8526020 0.6353685
## 
## Rotation (n x k) = (4 x 4):
##               PC1         PC2        PC3        PC4
## HP      0.5009303 -0.06463396  0.8300858 -0.2363236
## Attack  0.6301797  0.02703796 -0.1621455  0.7588487
## Defense 0.4556878 -0.61865282 -0.4521283 -0.4529871
## Speed   0.3798566  0.78253440 -0.2832778 -0.4038596

Looking at the cumiltive proportion, it takes 3 components to describe at least 75% of the variance in the data

– Results of PCA

The PCA models produce additional diagnostic and output components:

center - the column means used to center the data
scale - the column sd used to scale the data
rotation - the direction of the prin comp vetors in terms of the original features/variables. This information somehow allows you to define new data in terms of the originial principal components.
x - the value of each observation in the original dataset projected to the principal components

pr.pokemon$center

##       HP   Attack  Defense    Speed 
## 69.25875 79.00125 73.84250 68.27750

pr.pokemon$scale

##       HP   Attack  Defense    Speed 
## 25.53467 32.45737 31.18350 29.06047

pr.pokemon$rotation

##               PC1         PC2        PC3        PC4
## HP      0.5009303 -0.06463396  0.8300858 -0.2363236
## Attack  0.6301797  0.02703796 -0.1621455  0.7588487
## Defense 0.4556878 -0.61865282 -0.4521283 -0.4529871
## Speed   0.3798566  0.78253440 -0.2832778 -0.4038596

head(pr.pokemon$x,10)

##              PC1          PC2         PC3          PC4
##  [1,] -1.7256845 -0.097546271 -0.05163582  0.207456766
##  [2,] -0.7783645  0.001484139  0.02184005 -0.039259320
##  [3,]  0.5559879  0.109293957  0.08715421 -0.325236471
##  [4,]  1.4899932 -0.669275804 -0.58272583 -0.485458941
##  [5,] -1.6113972  0.577730653 -0.36963560  0.142341152
##  [6,] -0.5904092  0.645964030 -0.17563029 -0.179301322
##  [7,]  0.7439432  0.753773847 -0.11031614 -0.465278473
##  [8,]  2.1192939  0.137402684 -0.81858156  0.130820647
##  [9,]  1.1322555  0.770434451 -0.21022907  0.002318739
## [10,] -1.5570505 -0.467129383 -0.29163598 -0.011297649

Visualizing and interpreting PCA results

biplot
- shows all of the original observations as points plotted by the first 2 principal components
- it also shows the original features as vectors mapped onto the first 2 principal components
- with the iris data notice how the pedal width and lenght are pointing in the same direction
- this means the variables are correlated. one explains the about all the variance of the other.
- I found a nicer biplot function PCbiplot() on stack overflow, made with ggplot2 and modified this to work a little better. I use this in most cases because its much clearer.
scree plot
- either shows the proportion of variance explained by each principal component or the cummulative variance explained by successive components
- all of the variance is explained when the number of components matches the number of original features/variables
- a couple steps are required to get the variance from the sd for each component for the plot

# creating a biplot
# this does not look as nice as the one he had in the video
biplot(pr.iris)

PCbiplot(pr.iris)

# Getting proportion of variance for a scree plot
pr.var <- pr.iris$sdev^2
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, 
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0,1), 
     type = "b")

– Interpreting biplots (1)

which two original variables have approximately the same loadings in the first two principal components?
- Attack and Hit Points (HP)
which two pokemon are the least similar in terms of the second principal component?
- 430 and 231 (highest and lowest points)
- I can use the ids to look up the name in the original dataset

PCbiplot(pr.pokemon)

– Variance explained

# Variability of each principal component: pr.var
pr.var <- pr.pokemon$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve

## [1] 0.4706937 0.2466505 0.1817326 0.1009233

– Visualize variance explained

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Practical issues with PCA

3 things need to be considered for a succesful PCA:

scaling the data
missing data
- either drop it or impute it
features that are categories
- drop it or encode it as numbers
- this is not covered in this class (much more pre-processing is needed in real world applicaitons than we are doing here I believe)

Importance of scaling:

the mtcars dataset has very different means and sd
look at the results of the biplots with and without scaling
without scaling the disp and hp overwhelm the other features simply because they have much higher variance in their units of measure

data(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

round(colMeans(mtcars), 2)

##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear 
##  20.09   6.19 230.72 146.69   3.60   3.22  17.85   0.44   0.41   3.69 
##   carb 
##   2.81

round(apply(mtcars, 2, sd), 2)

##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear 
##   6.03   1.79 123.94  68.56   0.53   0.98   1.79   0.50   0.50   0.74 
##   carb 
##   1.62

pr.mtcars_no_scale <- prcomp(x = mtcars, scale = F, center = F)
pr.mtcars_scale <- prcomp(x = mtcars, scale = T, center = T)

PCbiplot(pr.mtcars_no_scale)

PCbiplot(pr.mtcars_scale)

– Practical issues: scaling

# Mean of each variable
colMeans(pokemon_pr)

##       HP   Attack  Defense    Speed 
## 69.25875 79.00125 73.84250 68.27750

# Standard deviation of each variable
apply(pokemon_pr, 2, sd)

##       HP   Attack  Defense    Speed 
## 25.53467 32.45737 31.18350 29.06047

# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(pokemon_pr, scale = T, center = T)

# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(pokemon_pr, scale = F, center = F)

# Create biplots of both for comparison
PCbiplot(pr.without.scaling)

PCbiplot(pr.with.scaling)

Putting it all together with a case sudy

Introduction to the case study

Data from paper by Bennet and Mangasarian.
- “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”
Human brest mass that was or was not malignant
- 10 features measured of each cell nuclei (I see 30 though)
- each features is a summary statistic of the cells in that mass
- includes diagnosis (target) - can be used for supervised learning but will not be used during the unsupervised analysis
overall steps
- download and prepare data
- EDA
- perform PCA and interpret results
- complete twy tupes of clustering
- understnd and compare the two types
- combine PCA and clustering

– Preparing the data

url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)
str(wisc.df)

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[, 3:32])
str(wisc.data)

##  num [1:569, 1:30] 18 20.6 19.7 11.4 20.3 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:30] "radius_mean" "texture_mean" "perimeter_mean" "area_mean" ...

head(wisc.data)

##      radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## [1,]       17.99        10.38         122.80    1001.0         0.11840
## [2,]       20.57        17.77         132.90    1326.0         0.08474
## [3,]       19.69        21.25         130.00    1203.0         0.10960
## [4,]       11.42        20.38          77.58     386.1         0.14250
## [5,]       20.29        14.34         135.10    1297.0         0.10030
## [6,]       12.45        15.70          82.57     477.1         0.12780
##      compactness_mean concavity_mean concave.points_mean symmetry_mean
## [1,]          0.27760         0.3001             0.14710        0.2419
## [2,]          0.07864         0.0869             0.07017        0.1812
## [3,]          0.15990         0.1974             0.12790        0.2069
## [4,]          0.28390         0.2414             0.10520        0.2597
## [5,]          0.13280         0.1980             0.10430        0.1809
## [6,]          0.17000         0.1578             0.08089        0.2087
##      fractal_dimension_mean radius_se texture_se perimeter_se area_se
## [1,]                0.07871    1.0950     0.9053        8.589  153.40
## [2,]                0.05667    0.5435     0.7339        3.398   74.08
## [3,]                0.05999    0.7456     0.7869        4.585   94.03
## [4,]                0.09744    0.4956     1.1560        3.445   27.23
## [5,]                0.05883    0.7572     0.7813        5.438   94.44
## [6,]                0.07613    0.3345     0.8902        2.217   27.19
##      smoothness_se compactness_se concavity_se concave.points_se
## [1,]      0.006399        0.04904      0.05373           0.01587
## [2,]      0.005225        0.01308      0.01860           0.01340
## [3,]      0.006150        0.04006      0.03832           0.02058
## [4,]      0.009110        0.07458      0.05661           0.01867
## [5,]      0.011490        0.02461      0.05688           0.01885
## [6,]      0.007510        0.03345      0.03672           0.01137
##      symmetry_se fractal_dimension_se radius_worst texture_worst
## [1,]     0.03003             0.006193        25.38         17.33
## [2,]     0.01389             0.003532        24.99         23.41
## [3,]     0.02250             0.004571        23.57         25.53
## [4,]     0.05963             0.009208        14.91         26.50
## [5,]     0.01756             0.005115        22.54         16.67
## [6,]     0.02165             0.005082        15.47         23.75
##      perimeter_worst area_worst smoothness_worst compactness_worst
## [1,]          184.60     2019.0           0.1622            0.6656
## [2,]          158.80     1956.0           0.1238            0.1866
## [3,]          152.50     1709.0           0.1444            0.4245
## [4,]           98.87      567.7           0.2098            0.8663
## [5,]          152.20     1575.0           0.1374            0.2050
## [6,]          103.40      741.6           0.1791            0.5249
##      concavity_worst concave.points_worst symmetry_worst
## [1,]          0.7119               0.2654         0.4601
## [2,]          0.2416               0.1860         0.2750
## [3,]          0.4504               0.2430         0.3613
## [4,]          0.6869               0.2575         0.6638
## [5,]          0.4000               0.1625         0.2364
## [6,]          0.5355               0.1741         0.3985
##      fractal_dimension_worst
## [1,]                 0.11890
## [2,]                 0.08902
## [3,]                 0.08758
## [4,]                 0.17300
## [5,]                 0.07678
## [6,]                 0.12440

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
head(wisc.data)

##          radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302         17.99        10.38         122.80    1001.0         0.11840
## 842517         20.57        17.77         132.90    1326.0         0.08474
## 84300903       19.69        21.25         130.00    1203.0         0.10960
## 84348301       11.42        20.38          77.58     386.1         0.14250
## 84358402       20.29        14.34         135.10    1297.0         0.10030
## 843786         12.45        15.70          82.57     477.1         0.12780
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302            0.27760         0.3001             0.14710        0.2419
## 842517            0.07864         0.0869             0.07017        0.1812
## 84300903          0.15990         0.1974             0.12790        0.2069
## 84348301          0.28390         0.2414             0.10520        0.2597
## 84358402          0.13280         0.1980             0.10430        0.1809
## 843786            0.17000         0.1578             0.08089        0.2087
##          fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302                  0.07871    1.0950     0.9053        8.589  153.40
## 842517                  0.05667    0.5435     0.7339        3.398   74.08
## 84300903                0.05999    0.7456     0.7869        4.585   94.03
## 84348301                0.09744    0.4956     1.1560        3.445   27.23
## 84358402                0.05883    0.7572     0.7813        5.438   94.44
## 843786                  0.07613    0.3345     0.8902        2.217   27.19
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302        0.006399        0.04904      0.05373           0.01587
## 842517        0.005225        0.01308      0.01860           0.01340
## 84300903      0.006150        0.04006      0.03832           0.02058
## 84348301      0.009110        0.07458      0.05661           0.01867
## 84358402      0.011490        0.02461      0.05688           0.01885
## 843786        0.007510        0.03345      0.03672           0.01137
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302       0.03003             0.006193        25.38         17.33
## 842517       0.01389             0.003532        24.99         23.41
## 84300903     0.02250             0.004571        23.57         25.53
## 84348301     0.05963             0.009208        14.91         26.50
## 84358402     0.01756             0.005115        22.54         16.67
## 843786       0.02165             0.005082        15.47         23.75
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302            184.60     2019.0           0.1622            0.6656
## 842517            158.80     1956.0           0.1238            0.1866
## 84300903          152.50     1709.0           0.1444            0.4245
## 84348301           98.87      567.7           0.2098            0.8663
## 84358402          152.20     1575.0           0.1374            0.2050
## 843786            103.40      741.6           0.1791            0.5249
##          concavity_worst concave.points_worst symmetry_worst
## 842302            0.7119               0.2654         0.4601
## 842517            0.2416               0.1860         0.2750
## 84300903          0.4504               0.2430         0.3613
## 84348301          0.6869               0.2575         0.6638
## 84358402          0.4000               0.1625         0.2364
## 843786            0.5355               0.1741         0.3985
##          fractal_dimension_worst
## 842302                   0.11890
## 842517                   0.08902
## 84300903                 0.08758
## 84348301                 0.17300
## 84358402                 0.07678
## 843786                   0.12440

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")

– Exploratory data analysis

How many observations are in this dataset?
- 569
How many variables/features in the data are suffixed with _mean?
- 10
How many of the observations have a malignant diagnosis?
- 212

str(wisc.data)

##  num [1:569, 1:30] 18 20.6 19.7 11.4 20.3 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:569] "842302" "842517" "84300903" "84348301" ...
##   ..$ : chr [1:30] "radius_mean" "texture_mean" "perimeter_mean" "area_mean" ...

colnames(wisc.data)

##  [1] "radius_mean"             "texture_mean"           
##  [3] "perimeter_mean"          "area_mean"              
##  [5] "smoothness_mean"         "compactness_mean"       
##  [7] "concavity_mean"          "concave.points_mean"    
##  [9] "symmetry_mean"           "fractal_dimension_mean" 
## [11] "radius_se"               "texture_se"             
## [13] "perimeter_se"            "area_se"                
## [15] "smoothness_se"           "compactness_se"         
## [17] "concavity_se"            "concave.points_se"      
## [19] "symmetry_se"             "fractal_dimension_se"   
## [21] "radius_worst"            "texture_worst"          
## [23] "perimeter_worst"         "area_worst"             
## [25] "smoothness_worst"        "compactness_worst"      
## [27] "concavity_worst"         "concave.points_worst"   
## [29] "symmetry_worst"          "fractal_dimension_worst"

str_match(colnames(wisc.data), "_mean")

##       [,1]   
##  [1,] "_mean"
##  [2,] "_mean"
##  [3,] "_mean"
##  [4,] "_mean"
##  [5,] "_mean"
##  [6,] "_mean"
##  [7,] "_mean"
##  [8,] "_mean"
##  [9,] "_mean"
## [10,] "_mean"
## [11,] NA     
## [12,] NA     
## [13,] NA     
## [14,] NA     
## [15,] NA     
## [16,] NA     
## [17,] NA     
## [18,] NA     
## [19,] NA     
## [20,] NA     
## [21,] NA     
## [22,] NA     
## [23,] NA     
## [24,] NA     
## [25,] NA     
## [26,] NA     
## [27,] NA     
## [28,] NA     
## [29,] NA     
## [30,] NA

table(diagnosis)

## diagnosis
##   0   1 
## 357 212

– Performing PCA

# Check column means and standard deviations
round(colMeans(wisc.data), 2)

##             radius_mean            texture_mean          perimeter_mean 
##                   14.13                   19.29                   91.97 
##               area_mean         smoothness_mean        compactness_mean 
##                  654.89                    0.10                    0.10 
##          concavity_mean     concave.points_mean           symmetry_mean 
##                    0.09                    0.05                    0.18 
##  fractal_dimension_mean               radius_se              texture_se 
##                    0.06                    0.41                    1.22 
##            perimeter_se                 area_se           smoothness_se 
##                    2.87                   40.34                    0.01 
##          compactness_se            concavity_se       concave.points_se 
##                    0.03                    0.03                    0.01 
##             symmetry_se    fractal_dimension_se            radius_worst 
##                    0.02                    0.00                   16.27 
##           texture_worst         perimeter_worst              area_worst 
##                   25.68                  107.26                  880.58 
##        smoothness_worst       compactness_worst         concavity_worst 
##                    0.13                    0.25                    0.27 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##                    0.11                    0.29                    0.08

round(apply(wisc.data, 2, sd), 2)

##             radius_mean            texture_mean          perimeter_mean 
##                    3.52                    4.30                   24.30 
##               area_mean         smoothness_mean        compactness_mean 
##                  351.91                    0.01                    0.05 
##          concavity_mean     concave.points_mean           symmetry_mean 
##                    0.08                    0.04                    0.03 
##  fractal_dimension_mean               radius_se              texture_se 
##                    0.01                    0.28                    0.55 
##            perimeter_se                 area_se           smoothness_se 
##                    2.02                   45.49                    0.00 
##          compactness_se            concavity_se       concave.points_se 
##                    0.02                    0.03                    0.01 
##             symmetry_se    fractal_dimension_se            radius_worst 
##                    0.01                    0.00                    4.83 
##           texture_worst         perimeter_worst              area_worst 
##                    6.15                   33.60                  569.36 
##        smoothness_worst       compactness_worst         concavity_worst 
##                    0.02                    0.16                    0.21 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##                    0.07                    0.06                    0.02

# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = T, center = T)

# Look at summary of results
summary(wisc.pr)

## Importance of components%s:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
##                            PC7     PC8    PC9    PC10   PC11    PC12
## Standard deviation     0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
## Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
## Cumulative Proportion  0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
## Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
## Cumulative Proportion  0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
##                           PC19    PC20   PC21    PC22    PC23   PC24
## Standard deviation     0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
## Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
## Cumulative Proportion  0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
## Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion  0.99942 0.99969 0.99992 0.99997 1.00000 1.00000

– Interpreting PCA results

# Create a biplot of wisc.pr
PCbiplot(wisc.pr)

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Do additional data exploration of your choosing below (optional)
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1), 
     xlab = "PC2", ylab = "PC3")

We can see from the charts that pc1 and pc2 overlap less than pc1 and pc3.
- This is expected as pc1 and pc2 are meant to be orthogonal and explain different variance
pc2 and pc3 overlap more then either of them overlap with pc1

– Variance explained

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

– Communicating PCA results

For the first principal component, what is the component of the loading vector for the feature concave.points_mean?
- -0.26085376
What is the minimum number of principal components required to explain 80% of the variance of the data?
- 5

wisc.pr$rotation[1:10,1:2]

##                                PC1         PC2
## radius_mean            -0.21890244  0.23385713
## texture_mean           -0.10372458  0.05970609
## perimeter_mean         -0.22753729  0.21518136
## area_mean              -0.22099499  0.23107671
## smoothness_mean        -0.14258969 -0.18611302
## compactness_mean       -0.23928535 -0.15189161
## concavity_mean         -0.25840048 -0.06016536
## concave.points_mean    -0.26085376  0.03476750
## symmetry_mean          -0.13816696 -0.19034877
## fractal_dimension_mean -0.06436335 -0.36657547

PCA review and next steps

– Hierarchical clustering of case data

# Scale the wisc.data data: data.scaled
head(wisc.data)

##          radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302         17.99        10.38         122.80    1001.0         0.11840
## 842517         20.57        17.77         132.90    1326.0         0.08474
## 84300903       19.69        21.25         130.00    1203.0         0.10960
## 84348301       11.42        20.38          77.58     386.1         0.14250
## 84358402       20.29        14.34         135.10    1297.0         0.10030
## 843786         12.45        15.70          82.57     477.1         0.12780
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302            0.27760         0.3001             0.14710        0.2419
## 842517            0.07864         0.0869             0.07017        0.1812
## 84300903          0.15990         0.1974             0.12790        0.2069
## 84348301          0.28390         0.2414             0.10520        0.2597
## 84358402          0.13280         0.1980             0.10430        0.1809
## 843786            0.17000         0.1578             0.08089        0.2087
##          fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302                  0.07871    1.0950     0.9053        8.589  153.40
## 842517                  0.05667    0.5435     0.7339        3.398   74.08
## 84300903                0.05999    0.7456     0.7869        4.585   94.03
## 84348301                0.09744    0.4956     1.1560        3.445   27.23
## 84358402                0.05883    0.7572     0.7813        5.438   94.44
## 843786                  0.07613    0.3345     0.8902        2.217   27.19
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302        0.006399        0.04904      0.05373           0.01587
## 842517        0.005225        0.01308      0.01860           0.01340
## 84300903      0.006150        0.04006      0.03832           0.02058
## 84348301      0.009110        0.07458      0.05661           0.01867
## 84358402      0.011490        0.02461      0.05688           0.01885
## 843786        0.007510        0.03345      0.03672           0.01137
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302       0.03003             0.006193        25.38         17.33
## 842517       0.01389             0.003532        24.99         23.41
## 84300903     0.02250             0.004571        23.57         25.53
## 84348301     0.05963             0.009208        14.91         26.50
## 84358402     0.01756             0.005115        22.54         16.67
## 843786       0.02165             0.005082        15.47         23.75
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302            184.60     2019.0           0.1622            0.6656
## 842517            158.80     1956.0           0.1238            0.1866
## 84300903          152.50     1709.0           0.1444            0.4245
## 84348301           98.87      567.7           0.2098            0.8663
## 84358402          152.20     1575.0           0.1374            0.2050
## 843786            103.40      741.6           0.1791            0.5249
##          concavity_worst concave.points_worst symmetry_worst
## 842302            0.7119               0.2654         0.4601
## 842517            0.2416               0.1860         0.2750
## 84300903          0.4504               0.2430         0.3613
## 84348301          0.6869               0.2575         0.6638
## 84358402          0.4000               0.1625         0.2364
## 843786            0.5355               0.1741         0.3985
##          fractal_dimension_worst
## 842302                   0.11890
## 842517                   0.08902
## 84300903                 0.08758
## 84348301                 0.17300
## 84358402                 0.07678
## 843786                   0.12440

data.scaled <- scale(wisc.data)
head(data.scaled)

##          radius_mean texture_mean perimeter_mean  area_mean
## 842302     1.0960995   -2.0715123      1.2688173  0.9835095
## 842517     1.8282120   -0.3533215      1.6844726  1.9070303
## 84300903   1.5784992    0.4557859      1.5651260  1.5575132
## 84348301  -0.7682333    0.2535091     -0.5921661 -0.7637917
## 84358402   1.7487579   -1.1508038      1.7750113  1.8246238
## 843786    -0.4759559   -0.8346009     -0.3868077 -0.5052059
##          smoothness_mean compactness_mean concavity_mean
## 842302         1.5670875        3.2806281     2.65054179
## 842517        -0.8262354       -0.4866435    -0.02382489
## 84300903       0.9413821        1.0519999     1.36227979
## 84348301       3.2806668        3.3999174     1.91421287
## 84358402       0.2801253        0.5388663     1.36980615
## 843786         2.2354545        1.2432416     0.86554001
##          concave.points_mean symmetry_mean fractal_dimension_mean
## 842302             2.5302489   2.215565542              2.2537638
## 842517             0.5476623   0.001391139             -0.8678888
## 84300903           2.0354398   0.938858720             -0.3976580
## 84348301           1.4504311   2.864862154              4.9066020
## 84358402           1.4272370  -0.009552062             -0.5619555
## 843786             0.8239307   1.004517928              1.8883435
##           radius_se texture_se perimeter_se    area_se smoothness_se
## 842302    2.4875451 -0.5647681    2.8305403  2.4853907    -0.2138135
## 842517    0.4988157 -0.8754733    0.2630955  0.7417493    -0.6048187
## 84300903  1.2275958 -0.7793976    0.8501802  1.1802975    -0.2967439
## 84348301  0.3260865 -0.1103120    0.2863415 -0.2881246     0.6890953
## 84358402  1.2694258 -0.7895490    1.2720701  1.1893103     1.4817634
## 843786   -0.2548461 -0.5921406   -0.3210217 -0.2890039     0.1562093
##          compactness_se concavity_se concave.points_se symmetry_se
## 842302       1.31570389    0.7233897        0.66023900   1.1477468
## 842517      -0.69231710   -0.4403926        0.25993335  -0.8047423
## 84300903     0.81425704    0.2128891        1.42357487   0.2368272
## 84348301     2.74186785    0.8187979        1.11402678   4.7285198
## 84358402    -0.04847723    0.8277425        1.14319885  -0.3607748
## 843786       0.44515196    0.1598845       -0.06906279   0.1340009
##          fractal_dimension_se radius_worst texture_worst perimeter_worst
## 842302             0.90628565    1.8850310   -1.35809849       2.3015755
## 842517            -0.09935632    1.8043398   -0.36887865       1.5337764
## 84300903           0.29330133    1.5105411   -0.02395331       1.3462906
## 84348301           2.04571087   -0.2812170    0.13386631      -0.2497196
## 84358402           0.49888916    1.2974336   -1.46548091       1.3373627
## 843786             0.48641784   -0.1653528   -0.31356043      -0.1149083
##          area_worst smoothness_worst compactness_worst concavity_worst
## 842302    1.9994782        1.3065367         2.6143647       2.1076718
## 842517    1.8888270       -0.3752817        -0.4300658      -0.1466200
## 84300903  1.4550043        0.5269438         1.0819801       0.8542223
## 84348301 -0.5495377        3.3912907         3.8899747       1.9878392
## 84358402  1.2196511        0.2203623        -0.3131190       0.6126397
## 843786   -0.2441054        2.0467119         1.7201029       1.2621327
##          concave.points_worst symmetry_worst fractal_dimension_worst
## 842302              2.2940576      2.7482041               1.9353117
## 842517              1.0861286     -0.2436753               0.2809428
## 84300903            1.9532817      1.1512420               0.2012142
## 84348301            2.1738732      6.0407261               4.9306719
## 84358402            0.7286181     -0.8675896              -0.3967505
## 843786              0.9050914      1.7525273               2.2398308

# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")

– Results of hierarchical clustering

cutting the height at 20 will give 4 clusters

plot(wisc.hclust)

I can kinda see why we could cut at 4. that gives us the to main clusers and then we have a couple tiny ones on the left.
It would be cool if we could color the lines by diagnosis somehow that helps us see where we should split.

– Selecting number of clusters

# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)

##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2

# count out of place observations based on cluster 
# basically just summing the row mins here
sum(apply(table(wisc.hclust.clusters, diagnosis), 1, min))

## [1] 54

Looks like 54 tumors that are not clear with the diagnosis based on the general cluster

– k-means clustering and comparing results

# Create a k-means model on wisc.data: wisc.km
head(wisc.data)

##          radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302         17.99        10.38         122.80    1001.0         0.11840
## 842517         20.57        17.77         132.90    1326.0         0.08474
## 84300903       19.69        21.25         130.00    1203.0         0.10960
## 84348301       11.42        20.38          77.58     386.1         0.14250
## 84358402       20.29        14.34         135.10    1297.0         0.10030
## 843786         12.45        15.70          82.57     477.1         0.12780
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302            0.27760         0.3001             0.14710        0.2419
## 842517            0.07864         0.0869             0.07017        0.1812
## 84300903          0.15990         0.1974             0.12790        0.2069
## 84348301          0.28390         0.2414             0.10520        0.2597
## 84358402          0.13280         0.1980             0.10430        0.1809
## 843786            0.17000         0.1578             0.08089        0.2087
##          fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302                  0.07871    1.0950     0.9053        8.589  153.40
## 842517                  0.05667    0.5435     0.7339        3.398   74.08
## 84300903                0.05999    0.7456     0.7869        4.585   94.03
## 84348301                0.09744    0.4956     1.1560        3.445   27.23
## 84358402                0.05883    0.7572     0.7813        5.438   94.44
## 843786                  0.07613    0.3345     0.8902        2.217   27.19
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302        0.006399        0.04904      0.05373           0.01587
## 842517        0.005225        0.01308      0.01860           0.01340
## 84300903      0.006150        0.04006      0.03832           0.02058
## 84348301      0.009110        0.07458      0.05661           0.01867
## 84358402      0.011490        0.02461      0.05688           0.01885
## 843786        0.007510        0.03345      0.03672           0.01137
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302       0.03003             0.006193        25.38         17.33
## 842517       0.01389             0.003532        24.99         23.41
## 84300903     0.02250             0.004571        23.57         25.53
## 84348301     0.05963             0.009208        14.91         26.50
## 84358402     0.01756             0.005115        22.54         16.67
## 843786       0.02165             0.005082        15.47         23.75
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302            184.60     2019.0           0.1622            0.6656
## 842517            158.80     1956.0           0.1238            0.1866
## 84300903          152.50     1709.0           0.1444            0.4245
## 84348301           98.87      567.7           0.2098            0.8663
## 84358402          152.20     1575.0           0.1374            0.2050
## 843786            103.40      741.6           0.1791            0.5249
##          concavity_worst concave.points_worst symmetry_worst
## 842302            0.7119               0.2654         0.4601
## 842517            0.2416               0.1860         0.2750
## 84300903          0.4504               0.2430         0.3613
## 84348301          0.6869               0.2575         0.6638
## 84358402          0.4000               0.1625         0.2364
## 843786            0.5355               0.1741         0.3985
##          fractal_dimension_worst
## 842302                   0.11890
## 842517                   0.08902
## 84300903                 0.08758
## 84348301                 0.17300
## 84358402                 0.07678
## 843786                   0.12440

wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)

##    diagnosis
##       0   1
##   1 343  37
##   2  14 175

sum(apply(table(wisc.km$cluster, diagnosis), 1, min))

## [1] 51

# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)

##                     
## wisc.hclust.clusters   1   2
##                    1  17 160
##                    2   0   7
##                    3 363  20
##                    4   0   2

sum(apply(table(wisc.hclust.clusters, wisc.km$cluster), 1, min))

## [1] 37

– Clustering on PCA results

Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data.
In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

# Create a hierarchical clustering model: wisc.pr.hclust
summary(wisc.pr)

## Importance of components%s:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
##                            PC7     PC8    PC9    PC10   PC11    PC12
## Standard deviation     0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
## Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
## Cumulative Proportion  0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
## Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
## Cumulative Proportion  0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
##                           PC19    PC20   PC21    PC22    PC23   PC24
## Standard deviation     0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
## Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
## Cumulative Proportion  0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
## Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion  0.99942 0.99969 0.99992 0.99997 1.00000 1.00000

wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)

# Compare to actual diagnoses
t <- table(wisc.pr.hclust.clusters, diagnosis)
t

##                        diagnosis
## wisc.pr.hclust.clusters   0   1
##                       1   5 113
##                       2 350  97
##                       3   2   0
##                       4   0   2

sum(apply(t, 1, min))

## [1] 102

# Compare to k-means and hierarchical
t <- table(wisc.hclust.clusters, diagnosis)
t

##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2

sum(apply(t, 1, min))

## [1] 54

t <- table(wisc.km$cluster, diagnosis)
t

##    diagnosis
##       0   1
##   1 343  37
##   2  14 175

sum(apply(t, 1, min))

## [1] 51

It looks like the 2 cluster k-means does the best job
The whole purpose of this is to see if the results of clustering could be useful in a supervised learning process.
- I think it might be worth adding the k-means clusters to a model. Maybe. I guess I could just try it with and with out and see whic is best at predicting now that I know how to do that. : )

Conclusion

This was a VERY high level overview on the topics of hierarcial and k-means clustering and PCA
I think it covers some good concepts like variable models selection, interpretation, and scaling data
I did get a little intuition on how the algorithms work
I don’t feel like I know these topics well or that I am ready to base business decisions on my knowledge of these model techniques.
It was a good way to get started though and I am now definitly ready to dig into these techniques more.

Unsupervised Learning in R

William Surles

2017-09-21

Introduction

Whats Covered

Additional Resources

Libraries and Data

Unsupervised Learning in R

Welcome to the course!

Introduction to k-means clustering

– k-means clustering

– Results of kmeans()

– Visualizing and interpreting results of kmeans()

How kmeans() works and practical matters

– Handling random algorithms

– Selecting number of clusters

Introduction to the Pokemon data

– Practical matters: working with real data

Hierarchical clustering

Introduction to hierarchical clustering

– Hierarchical clustering with results

Selecting number of clusters

– Cutting the tree

Clustering linkage and practical matters

– Linkage methods

– Practical matters: scaling

– Comparing kmeans() and hclust()

Dimensionality reduction with PCA

Introduction to PCA

– PCA using prcomp()

– Results of PCA

Visualizing and interpreting PCA results

– Interpreting biplots (1)

– Variance explained

– Visualize variance explained

Practical issues with PCA

– Practical issues: scaling

Putting it all together with a case sudy

Introduction to the case study

– Preparing the data

– Exploratory data analysis

– Performing PCA

– Interpreting PCA results

– Variance explained

– Communicating PCA results

PCA review and next steps

– Hierarchical clustering of case data

– Results of hierarchical clustering

– Selecting number of clusters

– k-means clustering and comparing results

– Clustering on PCA results

Conclusion